In this assessment, you need to answer all the questions about KNN, Linear Regression, Regularization, Logistic Regression, K-fold cross-validation, and other concepts covered in Module 1-3. R studio is recommended to use to complete your assessment. All codes need comments to help markers to understand your idea. If no comment is given, you may have a 10% redundancy on your mark. Please refer to weekly activities as examples for how to write comments. After answering all the questions, please knit your R notebook file to HTML or PDF format. Submit both .rmd file and .html or .pdf file to assessment 1 dropbox via the link on the Assessment page. You can compress your files into a zip file for submission. The total mark of this assessment is 100, which worths 30% of your final result.
# import libraries
library(ggplot2)
library(reshape)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
##
## rename
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(grid)
library(gtable)
library(stringr)
library(caret)
## Loading required package: lattice
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(AICcmodavg)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:reshape':
##
## expand
## Loaded glmnet 4.1-2
library(corrplot)
## corrplot 0.90 loaded
hint: Please review all reading materials in Module 1-3 carefully, especially the activities.
In this question, you are required to implement a KNN classifier to predict the class of cancer clumps. The breast cancer dataset is used in this question. A detailed description of this data set can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29.
Specifically, you need to:
## 'data.frame': 699 obs. of 10 variables:
## $ clumpThickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ sizeUniformity : int 1 4 1 8 1 10 1 1 1 2 ...
## $ shapeUniformity : int 1 4 1 8 1 10 1 2 1 1 ...
## $ marginalAdhesion : int 1 5 1 1 3 8 1 1 1 1 ...
## $ singleEpithelialCellSize: int 2 7 2 3 2 7 2 2 2 2 ...
## $ bareNuclei : int 1 10 2 4 1 10 10 1 1 1 ...
## $ blandChromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ normalNucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ mitosis : int 1 1 1 1 1 1 1 1 5 1 ...
## $ class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
## clumpThickness sizeUniformity shapeUniformity
## 0 0 0
## marginalAdhesion singleEpithelialCellSize bareNuclei
## 0 0 16
## blandChromatin normalNucleoli mitosis
## 0 0 0
## class
## 0
We can see there are 16 null values in the bareNuclei column. I will remove the rows with null values
## [1] 683 10
Plotting the data I will look for outliers.
I can see that all of the data lies within the expected ranges (1-10) as based on the data info, so no need to remove any outliers. I can also see that the class column was also converted for all rows correctly.
## [1] 478 9
## [1] 205 9
A brief explanation of each of the metrics:
All of the 4 metrics give very similar clustering results, with Canberra giving the highest % accuracy with K values of 8, 10, 11, 12, 13, and 14, Minkowski and Euclidean have produced identical results, and Manhattan produced the lowest accuracy.
From my research, Minkowski is actually a generalised form of Euclidean and Manhattan it will calculate a different distance based on the ‘p’ value that is passed. If p=1 is passed, it will return the Manhattan distance. If p=2 is passed, it will return the Euclidean distance. From the documentation of the dist() function (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist), we can see that p=2 is actually the default value that is passed when p is not specified. This explains why the outputs of Minkowski and Euclidean are identical.
I also plotted Minkowski with p=1, to compare the output to Manhattan.
## [[1]]
## [1] "Minkowski , p= 1"
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 97.56098
As we can see, it returns the same result as the Manhattan distance when passed p=1
It is interesting to see that all of the other distances produced the maximum accuracy value at multiple K-values, except for Euclidean and Minkowski p=2.
As for the Canberra distance, it is often used for data scattered around an origin, as it is biased for measures around the origin and very sensitive for values close to zero. It is able to take into account several types of data (binary, categorical, quantitative, etc.). I would say that the Canberra distance would be quite well suited to this dataset - there are multiple variables to learn from and it is often used in to identify anomolies in datasets.
It should also be noted that the Canberra distances produced the highest accuracy at a number of cluster values. However, I will go with the highest value as it has the lowest model complexity and less noise in the model. Therefore, I will say that the Canberra distance has produced the highest accuracy clustering model with K=14 as the ideal number of clusters.
# using caret package
fit <- train(class~ .,
method = "knn",
tuneGrid = expand.grid(k = 14),
trControl = trainControl(method = "LOOCV"),
metric = "Accuracy",
data = df)
fit
## k-Nearest Neighbors
##
## 683 samples
## 9 predictor
## 2 classes: 'benign', 'malignant'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 682, 682, 682, 682, 682, 682, ...
## Resampling results:
##
## Accuracy Kappa
## 0.966325 0.9259104
##
## Tuning parameter 'k' was held constant at a value of 14
The LOOCV method has produced an accuracy of \(96.4869%\) when K = 14. This is lower than the accuracy determined by the Canberra distance, but is still within the ballpark, especially when taking into account the other distances are all around the same value. The Kappa value (Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)) is also impressively high indicating that this model has almost perfect agreement with the test and train data.
In this question, you need to implement a linear regression model to predict the residuary resistance of sailing yachts. The data set used in this question can be found in ‘yacht_hydrodynamics.csv.’ The data set has 7 features, which are summarized as below.
Variations concern hull geometry coefficients and the Froude number:
The measured variable is the residuary resistance per unit weight of displacement:
Specifically, you need to:
df2 = read.csv('yacht_hydrodynamics.csv')
str(df2)
## 'data.frame': 308 obs. of 7 variables:
## $ V1: num -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 -2.3 ...
## $ V2: num 0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 0.568 ...
## $ V3: num 4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 4.78 ...
## $ V4: num 3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 3.99 ...
## $ V5: num 3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 3.17 ...
## $ V6: num 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 ...
## $ V7: num 0.11 0.27 0.47 0.78 1.18 1.82 2.61 3.76 4.99 7.16 ...
colSums(is.na(df2))
## V1 V2 V3 V4 V5 V6 V7
## 0 0 0 0 4 0 0
There is a total of 308 rows of data in the dataset, and that 4 rows in column V5 have null values. The values have also been read in as numerical data types which I will leave as is. I will remove the rows with null values and plot the data in boxplots to determine any significant outliers.
df2 <- na.omit(df2)
dim(df2)
## [1] 304 7
Plotting the data I will look for anomolies.
# First check for duplicated data
sum(duplicated(df2))
## [1] 0
scatterplotMatrix(df2, regLine = list(col="blue"), smooth=list(col.smooth="orange", col.spread="orange"), col='black')
meltData <- melt(df2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
summary(df2)
## V1 V2 V3 V4
## Min. :-5.000 Min. :0.5300 Min. :4.340 Min. :2.810
## 1st Qu.:-2.400 1st Qu.:0.5460 1st Qu.:4.770 1st Qu.:3.750
## Median :-2.300 Median :0.5650 Median :4.780 Median :3.960
## Mean :-2.381 Mean :0.5641 Mean :4.788 Mean :3.937
## 3rd Qu.:-2.300 3rd Qu.:0.5740 3rd Qu.:5.100 3rd Qu.:4.170
## Max. : 0.000 Max. :0.6000 Max. :5.140 Max. :5.350
## V5 V6 V7
## Min. :2.730 Min. :0.1250 Min. : 0.0100
## 1st Qu.:3.150 1st Qu.:0.2000 1st Qu.: 0.7675
## Median :3.150 Median :0.2875 Median : 3.0100
## Mean :3.206 Mean :0.2873 Mean :10.5261
## 3rd Qu.:3.510 3rd Qu.:0.3750 3rd Qu.:12.8150
## Max. :3.640 Max. :0.4500 Max. :62.4200
There is no duplicated data, and there does not appear to be any values that are significant outliers - only the Target column has some extreme values above 30, but there are 44 rows of data here (approx. 15%) and I think this is significant enough to keep.
Next I will look at standardising the values in preparation for linear regression. This will be particularly important when we get to applying Lasso and Ridge Regularisation
## V1 V2 V3 V4
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5200 1st Qu.:0.2286 1st Qu.:0.5375 1st Qu.:0.3701
## Median :0.5400 Median :0.5000 Median :0.5500 Median :0.4528
## Mean :0.5238 Mean :0.4866 Mean :0.5596 Mean :0.4436
## 3rd Qu.:0.5400 3rd Qu.:0.6286 3rd Qu.:0.9500 3rd Qu.:0.5354
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## V5 V6 V7
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.4615 1st Qu.:0.2308 1st Qu.:0.01214
## Median :0.4615 Median :0.5000 Median :0.04807
## Mean :0.5228 Mean :0.4995 Mean :0.16850
## 3rd Qu.:0.8571 3rd Qu.:0.7692 3rd Qu.:0.20518
## Max. :1.0000 Max. :1.0000 Max. :1.00000
Further looking at the data, V7 appears to be negatively exponential in shape, and V6 is very obviously not normally distributed. Looking at this data, I don’t think the relationship between the variables would be linear in nature.
There are also some apparent correlations between the variables:
## [1] 243 6
## [1] 61 6
Define an auxiliary function that calculates the prediction and cost based on the projected data and estimated coefficients.
# auxiliary function to calculate labels based on the estimated coefficients
predict_func <- function(Phi, w){
# y(x,w) = Phi(x) * w
return(Phi%*%w)
}
# The cost function for linear regression is the sum of squared error
error_func <- function (Phi, w, target){
return(0.5 * sum((predict_func(Phi, w) - target)^2))
}
# RMSE function to calculate error
rmse_func <- function (Phi, w, target){
return(sqrt(mean((predict_func(Phi, w) - target)^2)))
}
Notes:
The loop will iterate until \(\tau\) tau.max OR the actual derivative (sum of squares) is less than epsilon
For the first iteration tau = 1 and W[tau,] is the runif randomly assigned weights in initialization step.
In each iteration, we shuffle the training data and hence Phi and T.
Each set of values for \(\pmb{w}^\tau\) is evaluated against the full set of \(\pmb{\phi}\) values hence error_func(Phi, W[tau,],T)
For each data point \(\pmb{\phi}_n\), we need to calculate \(\nabla E(\pmb{w}) = -(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n\) and then update the \(\pmb{w}^{\tau+1}\) by subtracting \(\eta * \nabla E(\pmb{w})\) from the current value of \(\pmb{w}^\tau\).
\(\pmb{w}^{\tau+1} := \pmb{w}^{\tau} - \eta \nabla E(\pmb{w}) = \pmb{w}^{\tau+1} - -\eta(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n = \pmb{w}^{\tau+1} + \eta(t_n - \pmb{w}^\tau \pmb{\phi}_n)\pmb{\phi}_n\) can be implemented as:
\(W[(tau+1),j] := W[tau,j] + \eta \times (T[i]-t\_pred) \times Phi[i,j]\) where \(t\_pred\) is \(Phi[i,] %*% W[tau,]\)
## [1] "tau: 1000 error: 0.171746126754658"
## [1] "tau: 2000 error: 0.147349879977765"
## [1] "tau: 3000 error: 0.14337344532503"
## [1] "tau: 4000 error: 0.144140519480111"
## [1] "The final coefficents are: "
## X0 X1 X2 X3 X4 X5
## -0.19062143 0.04058752 -0.01763769 -0.05003362 0.06263530 0.06526828
## X6
## 0.63503127
## [1] "The final training error is: 0.142716459101373"
## [1] "The final testing error is: 0.145290951967881"
# prepare error dataframe for visualization:
error.m <- melt(a$error, id='tau')
names(error.m)<-c('tau','dataset','RMSE')
# plot it
ggplot(data=error.m, aes(x=tau, y=RMSE, color=dataset)) +
geom_line() +
ggtitle('Testing and Training RMSE Trends') +
theme_minimal()
# convert W matrix to a dataframe and melt for plotting
W.df <- as.data.frame(a$W); names(W.df) <- c('w0','w1','w2','w3','w4', 'w5', 'w6')
W.df$tau <- 1:nrow(W.df)
W.m <-melt(W.df, id='tau');
names(W.m) <- c('tau', 'coefficients', 'values')
ggplot(data=W.m, aes(x=tau, y=values, color=coefficients)) + geom_line() + ggtitle('Estimated Coefficients') + theme_minimal()
I have played around with a number of variations on the eta, tau.max, and the initial values of W. From what I have seen, the lowest RMSE values are produced when \(\eta = 0.01\). I have played around with running the model for more iterations, but both coefficients and error tend to get quite flat after the 3000 mark. I extended to 5000 to confirm the convergence of the cost.
Looking at the training vs. testing error graph, it is clear that the model performs quite similarly on both sets. In fact looking at the final error values for both it is clear that they are incredibly similar. It is interesting to see that there is a sharp drop around the 200th iteration where the model has started to perform better and the error begins to become quite small, however there is not that much difference from this point onwards through to the 5000th iteration - a 0.03 amount of difference in error. The test and train error rates also don’t seem to diverge at these numbers, so I would believe that the chosen learning rate of probably the best for this dataset.
Looking at the coefficient estimates over time, we can see that they were originally randomly generated, but all manage to plateau around the 3000th iteration, although there are still small jumps in the estimates. This is to be expected though, due to the random nature of SGD, but overall I think the model has learned well.
The initial value of W was another variable that I played around with. I tried \(W=0\), \(W=0.25\), \(W=0.5\), \(W=0.75\), \(W=1\), and \(W=random\). I found that there was little difference in the performance after 3000 iterations in all the models. Even though they jumped around quite a bit at the beginning, they all eventually converged around the 2000 iteration point, and flattened out further after 3000. This is probably due to the scaled and normalised nature of the data - the model doesn’t have too many positions to jump around and can find the optimal values quickly.
eta = 0.01, tau.max = 5000, W = random
eta = 0.01, tau.max = 5000, W = 0
eta = 0.01, tau.max = 5000, W = 0.25
eta = 0.01, tau.max = 5000, W = 0.5
eta = 0.01, tau.max = 5000, W = 0.75
eta = 0.01, tau.max = 5000, W = 1
Also the final test RMSE is \(\approx 0.14510\). I consider this to be a fairly low value considering the data is scaled from 0 to 1, so it appears the model is quite decent at predicting the values.
print(paste('The final RMSE error is:', a$f_test_error))
## [1] "The final RMSE error is: 0.145290951967881"
# convert values to matrix
coeffs <- matrix()
for (i in 1:7) {
coeffs <- cbind(coeffs, a$f_coeffs[[i]])
}
# remove null rows
coeffs <- coeffs[-1]
coeffs
## [1] -0.19062143 0.04058752 -0.01763769 -0.05003362 0.06263530 0.06526828
## [7] 0.63503127
evalModel <- function(learn_data, target_data, coeffs) {
model_resids <- setNames(data.frame(matrix(ncol = 3)), c("Actual", "Predicted", "Residual"))
size <- length(target_data)
for (i in 1:size) {
row <- as.matrix(cbind('v0'=1, learn_data[i,]))
yhat <- sum(data.frame(mapply(`*`, row, coeffs)))
y <- target_data[i]
resid <- y - yhat
model_resids <- rbind(model_resids, c(y, yhat, resid))
}
model_resids <- na.omit(model_resids)
return(model_resids)
}
eval <- evalModel(test.data2, test.target2, matrix(coeffs))
eval
rss <- sum((eval$Predicted - eval$Actual) ^ 2)
tss <- sum((eval$Actual - mean(eval$Actual)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.6705301
print(paste("The model has a", round(rsq*100, digits=2), "% accuracy on the test data"))
## [1] "The model has a 67.05 % accuracy on the test data"
ggplot(eval, aes(x = eval$Predicted, y = eval$Residual)) + # Set up canvas with outcome variable on y-axis
geom_point() +
geom_line(aes(y = eval$Predicted))
There is clearly a pattern (maybe degree of 3?) in the residual data which leads me to think that this dataset is not fit for linear regression. This is further supported by the accuracy of the model on the test data (\(\approx 67\%\)) which indicates that the model isn’t a very good fit for the data. This is not too surprising since I could already see at the beginning when exploring the data that there were signs that this dataset was not fit for a linear model, such as:
print(paste('The final training RMSE is:', round(a$f_train_error, digits=4)))
## [1] "The final training RMSE is: 0.1427"
print(paste('The final testing RMSE is:', round(a$f_test_error, digits=4)))
## [1] "The final testing RMSE is: 0.1453"
chart.Correlation(df2scaled, histogram=TRUE, pch=19)
Considering that the RMSE of both the datasets are so incredibly close, I think the model does a good job at not overfitting or underfitting overall - it appears to be just right. However, the training error is slightly less than the testing error, and this has been the case in every case I ran the data, so it suggests a very small degree of over fitting. This could be due to the small size of the dataset.
Generally, test RMSE > train RMSE => OVER FITTING of the data test RMSE < train RMSE => UNDER FITTING of the data
Looking back at the correlation graph, it’s very clear that only V6 has any significant correlation to the target V7, and looking at the shapes of the data in these two columns this is even more clear. From Wikipedia, it is said that “in naval architecture the Froude number is a significant figure used to determine the resistance of a partially submerged object moving through water”, so it makes sense that it would be a significant input in determining the resistance, but I am surprised that none of the other variables appear to have any sort of significant influence.
The Froude number is a dimensionless quantity used to indicate the influence of gravity on fluid motion. From my research (https://www.mermaid-consultants.com/ship-resistance-calculation.html), it is said that Ship resistance is normally a combination of four resistances:
Also from (https://www.sciencedirect.com/topics/engineering/residuary-resistance), it states that it is usual to group wave-making resistance, form resistance, eddy resistance and frictional resistance into one force termed => residuary resistance (ie. V7).
Learning this, it actually makes sense that none of the other elements contribute significantly to the residuary resistance. They may contribute to the other resistance elements that then contribute to the residuary resistance.
To see which variables will be the most significant for V7, I will do a backward selection on all the variables to find which variables produce the model with the lowest AIC.
model <- lm(V7 ~ ., data=df2scaled)
smodel <- step(model)
## Start: AIC=-1170.23
## V7 ~ V1 + V2 + V3 + V4 + V5 + V6
##
## Df Sum of Sq RSS AIC
## - V2 1 0.0002 6.1816 -1172.22
## - V3 1 0.0017 6.1831 -1172.14
## - V5 1 0.0019 6.1833 -1172.13
## - V4 1 0.0020 6.1834 -1172.13
## - V1 1 0.0078 6.1892 -1171.84
## <none> 6.1814 -1170.23
## - V6 1 11.8244 18.0058 -847.21
##
## Step: AIC=-1172.22
## V7 ~ V1 + V3 + V4 + V5 + V6
##
## Df Sum of Sq RSS AIC
## - V1 1 0.0079 6.1895 -1173.8
## - V3 1 0.0104 6.1920 -1173.7
## - V5 1 0.0108 6.1924 -1173.7
## - V4 1 0.0133 6.1949 -1173.6
## <none> 6.1816 -1172.2
## - V6 1 11.8243 18.0059 -849.2
##
## Step: AIC=-1173.83
## V7 ~ V3 + V4 + V5 + V6
##
## Df Sum of Sq RSS AIC
## - V3 1 0.0105 6.2001 -1175.31
## - V5 1 0.0109 6.2005 -1175.29
## - V4 1 0.0134 6.2030 -1175.17
## <none> 6.1895 -1173.83
## - V6 1 11.8294 18.0189 -850.98
##
## Step: AIC=-1175.31
## V7 ~ V4 + V5 + V6
##
## Df Sum of Sq RSS AIC
## - V5 1 0.0005 6.2005 -1177.29
## - V4 1 0.0031 6.2032 -1177.16
## <none> 6.2001 -1175.31
## - V6 1 11.8255 18.0256 -852.87
##
## Step: AIC=-1177.29
## V7 ~ V4 + V6
##
## Df Sum of Sq RSS AIC
## - V4 1 0.0026 6.2032 -1179.16
## <none> 6.2005 -1177.29
## - V6 1 11.8251 18.0256 -854.87
##
## Step: AIC=-1179.16
## V7 ~ V6
##
## Df Sum of Sq RSS AIC
## <none> 6.2032 -1179.16
## - V6 1 11.825 18.0277 -856.84
summary(model)
##
## Call:
## lm(formula = V7 ~ ., data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18850 -0.12175 -0.03011 0.09891 0.50595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.119401 0.090762 -1.316 0.189
## V1 0.016848 0.027472 0.613 0.540
## V2 -0.005155 0.049956 -0.103 0.918
## V3 0.052837 0.183049 0.289 0.773
## V4 -0.070592 0.226498 -0.312 0.756
## V5 -0.063424 0.208857 -0.304 0.762
## V6 0.633621 0.026583 23.836 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1443 on 297 degrees of freedom
## Multiple R-squared: 0.6571, Adjusted R-squared: 0.6502
## F-statistic: 94.87 on 6 and 297 DF, p-value: < 2.2e-16
anova(model)
It appears that the model which produces the lowest AIC value of \(AIC=-1179.16\) is V7 ~ V6, so the most significant variable in determining the Residuary resistance per unit weight of displacement is the Froude number - however this is only with a linear model of degree 1.
It is possible that a higher order model would be a better fit for the data points
fmla1<- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',1,raw=TRUE)',collapse = ' + '))))
deg1 <-lm(fmla1,data=df2scaled)
interaction <- aov(deg1, data = df2scaled)
print(paste("------------------------------DEGREE 1 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 1 SUMMARY------------------------------"
summary(deg1)
##
## Call:
## lm(formula = fmla1, data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18850 -0.12175 -0.03011 0.09891 0.50595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.119401 0.090762 -1.316 0.189
## poly(V1, 1, raw = TRUE) 0.016848 0.027472 0.613 0.540
## poly(V2, 1, raw = TRUE) -0.005155 0.049956 -0.103 0.918
## poly(V3, 1, raw = TRUE) 0.052837 0.183049 0.289 0.773
## poly(V4, 1, raw = TRUE) -0.070592 0.226498 -0.312 0.756
## poly(V5, 1, raw = TRUE) -0.063424 0.208857 -0.304 0.762
## poly(V6, 1, raw = TRUE) 0.633621 0.026583 23.836 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1443 on 297 degrees of freedom
## Multiple R-squared: 0.6571, Adjusted R-squared: 0.6502
## F-statistic: 94.87 on 6 and 297 DF, p-value: < 2.2e-16
print(paste("------------------------------ANOVA 1 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 1 SUMMARY------------------------------"
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(V1, 1, raw = TRUE) 1 0.013 0.013 0.630 0.428
## poly(V2, 1, raw = TRUE) 1 0.008 0.008 0.363 0.547
## poly(V3, 1, raw = TRUE) 1 0.000 0.000 0.002 0.966
## poly(V4, 1, raw = TRUE) 1 0.000 0.000 0.013 0.910
## poly(V5, 1, raw = TRUE) 1 0.001 0.001 0.049 0.826
## poly(V6, 1, raw = TRUE) 1 11.824 11.824 568.134 <2e-16 ***
## Residuals 297 6.181 0.021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla2<- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',2,raw=TRUE)',collapse = ' + '))))
deg2 <-lm(fmla2,data=df2scaled)
interaction <- aov(deg2, data = df2scaled)
print(paste("------------------------------DEGREE 2 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 2 SUMMARY------------------------------"
summary(deg2)
##
## Call:
## lm(formula = fmla2, data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.116579 -0.060889 0.002038 0.048669 0.272391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14652 0.14891 0.984 0.326
## poly(V1, 2, raw = TRUE)1 -0.03227 0.04318 -0.747 0.455
## poly(V1, 2, raw = TRUE)2 0.04680 0.04182 1.119 0.264
## poly(V2, 2, raw = TRUE)1 -0.02234 0.09143 -0.244 0.807
## poly(V2, 2, raw = TRUE)2 0.02351 0.03691 0.637 0.525
## poly(V3, 2, raw = TRUE)1 0.09769 0.33013 0.296 0.767
## poly(V3, 2, raw = TRUE)2 -0.01094 0.05441 -0.201 0.841
## poly(V4, 2, raw = TRUE)1 -0.17161 0.48123 -0.357 0.722
## poly(V4, 2, raw = TRUE)2 0.06233 0.14156 0.440 0.660
## poly(V5, 2, raw = TRUE)1 -0.11409 0.36243 -0.315 0.753
## poly(V5, 2, raw = TRUE)2 0.01287 0.06443 0.200 0.842
## poly(V6, 2, raw = TRUE)1 -0.84523 0.04662 -18.132 <2e-16 ***
## poly(V6, 2, raw = TRUE)2 1.48002 0.04498 32.903 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06702 on 291 degrees of freedom
## Multiple R-squared: 0.9275, Adjusted R-squared: 0.9245
## F-statistic: 310.2 on 12 and 291 DF, p-value: < 2.2e-16
print(paste("------------------------------ANOVA 2 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 2 SUMMARY------------------------------"
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(V1, 2, raw = TRUE) 2 0.018 0.009 2.039 0.132
## poly(V2, 2, raw = TRUE) 2 0.010 0.005 1.094 0.336
## poly(V3, 2, raw = TRUE) 2 0.000 0.000 0.028 0.973
## poly(V4, 2, raw = TRUE) 2 0.001 0.000 0.056 0.945
## poly(V5, 2, raw = TRUE) 2 0.000 0.000 0.032 0.969
## poly(V6, 2, raw = TRUE) 2 16.691 8.346 1857.941 <2e-16 ***
## Residuals 291 1.307 0.004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla3 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',3,raw=TRUE)',collapse = ' + '))))
deg3 <-lm(fmla3,data=df2scaled)
interaction <- aov(deg3, data = df2scaled)
print(paste("------------------------------DEGREE 3 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 3 SUMMARY------------------------------"
summary(deg3)
##
## Call:
## lm(formula = fmla3, data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.084031 -0.019138 -0.002433 0.016782 0.171410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13444 0.20222 0.665 0.507
## poly(V1, 3, raw = TRUE)1 3.29372 4.29317 0.767 0.444
## poly(V1, 3, raw = TRUE)2 -9.40798 12.21724 -0.770 0.442
## poly(V1, 3, raw = TRUE)3 6.12993 7.92416 0.774 0.440
## poly(V2, 3, raw = TRUE)1 0.19990 0.44713 0.447 0.655
## poly(V2, 3, raw = TRUE)2 -0.60916 1.20945 -0.504 0.615
## poly(V2, 3, raw = TRUE)3 0.41382 0.79384 0.521 0.603
## poly(V3, 3, raw = TRUE)1 -0.07407 0.66427 -0.112 0.911
## poly(V3, 3, raw = TRUE)2 1.05106 1.65646 0.635 0.526
## poly(V3, 3, raw = TRUE)3 -0.77163 1.10698 -0.697 0.486
## poly(V4, 3, raw = TRUE)1 -0.75497 0.99425 -0.759 0.448
## poly(V4, 3, raw = TRUE)2 1.20968 1.53862 0.786 0.432
## poly(V4, 3, raw = TRUE)3 -0.75041 0.95772 -0.784 0.434
## poly(V5, 3, raw = TRUE)1 -0.42678 0.84076 -0.508 0.612
## poly(V5, 3, raw = TRUE)2 0.33463 1.29710 0.258 0.797
## poly(V5, 3, raw = TRUE)3 -0.15387 0.80055 -0.192 0.848
## poly(V6, 3, raw = TRUE)1 0.64202 0.04863 13.202 <2e-16 ***
## poly(V6, 3, raw = TRUE)2 -2.37788 0.11558 -20.574 <2e-16 ***
## poly(V6, 3, raw = TRUE)3 2.57034 0.07581 33.904 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03018 on 285 degrees of freedom
## Multiple R-squared: 0.9856, Adjusted R-squared: 0.9847
## F-statistic: 1084 on 18 and 285 DF, p-value: < 2.2e-16
print(paste("------------------------------ANOVA 3 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 3 SUMMARY------------------------------"
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(V1, 3, raw = TRUE) 3 0.019 0.006 6.873 0.000175 ***
## poly(V2, 3, raw = TRUE) 3 0.010 0.003 3.511 0.015742 *
## poly(V3, 3, raw = TRUE) 3 0.000 0.000 0.063 0.979446
## poly(V4, 3, raw = TRUE) 3 0.001 0.000 0.390 0.760197
## poly(V5, 3, raw = TRUE) 3 0.001 0.000 0.406 0.748735
## poly(V6, 3, raw = TRUE) 3 17.738 5.913 6493.082 < 2e-16 ***
## Residuals 285 0.260 0.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla4 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',4,raw=TRUE)',collapse = ' + '))))
deg4 <-lm(fmla4,data=df2scaled)
interaction <- aov(deg4, data = df2scaled)
print(paste("------------------------------DEGREE 4 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 4 SUMMARY------------------------------"
summary(deg4)
##
## Call:
## lm(formula = fmla4, data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.108116 -0.012216 0.000432 0.009588 0.144277
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20373 1.31279 0.155 0.8768
## poly(V1, 4, raw = TRUE)1 -64.51317 922.16473 -0.070 0.9443
## poly(V1, 4, raw = TRUE)2 320.83029 4410.68173 0.073 0.9421
## poly(V1, 4, raw = TRUE)3 -509.66455 6785.77009 -0.075 0.9402
## poly(V1, 4, raw = TRUE)4 253.36395 3297.25655 0.077 0.9388
## poly(V2, 4, raw = TRUE)1 1.03372 10.41229 0.099 0.9210
## poly(V2, 4, raw = TRUE)2 -3.70244 42.83247 -0.086 0.9312
## poly(V2, 4, raw = TRUE)3 3.96886 59.83807 0.066 0.9472
## poly(V2, 4, raw = TRUE)4 -1.28301 26.94680 -0.048 0.9621
## poly(V3, 4, raw = TRUE)1 -0.41672 1.36238 -0.306 0.7599
## poly(V3, 4, raw = TRUE)2 3.38742 7.39639 0.458 0.6473
## poly(V3, 4, raw = TRUE)3 -4.61583 9.23514 -0.500 0.6176
## poly(V3, 4, raw = TRUE)4 1.94572 5.67872 0.343 0.7321
## poly(V4, 4, raw = TRUE)1 -1.37201 4.70745 -0.291 0.7709
## poly(V4, 4, raw = TRUE)2 3.55233 3.97711 0.893 0.3725
## poly(V4, 4, raw = TRUE)3 -4.27372 4.16857 -1.025 0.3061
## poly(V4, 4, raw = TRUE)4 1.68621 2.30955 0.730 0.4659
## poly(V5, 4, raw = TRUE)1 -0.49561 3.20825 -0.154 0.8773
## poly(V5, 4, raw = TRUE)2 0.18513 0.60240 0.307 0.7588
## poly(V5, 4, raw = TRUE)3 NA NA NA NA
## poly(V5, 4, raw = TRUE)4 NA NA NA NA
## poly(V6, 4, raw = TRUE)1 -0.12669 0.07097 -1.785 0.0753 .
## poly(V6, 4, raw = TRUE)2 1.39366 0.30599 4.555 7.83e-06 ***
## poly(V6, 4, raw = TRUE)3 -3.43925 0.46881 -7.336 2.36e-12 ***
## poly(V6, 4, raw = TRUE)4 3.00672 0.23256 12.929 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02398 on 281 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1412 on 22 and 281 DF, p-value: < 2.2e-16
print(paste("------------------------------ANOVA 4 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 4 SUMMARY------------------------------"
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(V1, 4, raw = TRUE) 4 0.019 0.005 8.212 2.82e-06 ***
## poly(V2, 4, raw = TRUE) 4 0.010 0.002 4.162 0.00273 **
## poly(V3, 4, raw = TRUE) 4 0.002 0.000 0.714 0.58311
## poly(V4, 4, raw = TRUE) 4 0.004 0.001 1.552 0.18746
## poly(V5, 4, raw = TRUE) 2 0.002 0.001 1.915 0.14931
## poly(V6, 4, raw = TRUE) 4 17.830 4.458 7752.393 < 2e-16 ***
## Residuals 281 0.162 0.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
fmla5 <- (as.formula(paste('V7 ~',paste('poly(',colnames(df2scaled[-7]),',5,raw=TRUE)',collapse = ' + '))))
deg5 <-lm(fmla5,data=df2scaled)
interaction <- aov(deg5, data = df2scaled)
print(paste("------------------------------DEGREE 5 SUMMARY------------------------------"))
## [1] "------------------------------DEGREE 5 SUMMARY------------------------------"
summary(deg5)
##
## Call:
## lm(formula = fmla5, data = df2scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.108140 -0.012221 0.000456 0.009573 0.144253
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.983e-01 9.268e-01 -0.214 0.8308
## poly(V1, 5, raw = TRUE)1 6.535e+04 2.798e+05 0.234 0.8155
## poly(V1, 5, raw = TRUE)2 -3.120e+05 1.336e+06 -0.234 0.8155
## poly(V1, 5, raw = TRUE)3 4.795e+05 2.053e+06 0.234 0.8155
## poly(V1, 5, raw = TRUE)4 -2.328e+05 9.967e+05 -0.234 0.8155
## poly(V1, 5, raw = TRUE)5 NA NA NA NA
## poly(V2, 5, raw = TRUE)1 -1.221e+03 5.225e+03 -0.234 0.8154
## poly(V2, 5, raw = TRUE)2 7.565e+03 3.236e+04 0.234 0.8153
## poly(V2, 5, raw = TRUE)3 -1.713e+04 7.326e+04 -0.234 0.8153
## poly(V2, 5, raw = TRUE)4 1.676e+04 7.165e+04 0.234 0.8152
## poly(V2, 5, raw = TRUE)5 -5.972e+03 2.553e+04 -0.234 0.8152
## poly(V3, 5, raw = TRUE)1 1.453e+02 6.147e+02 0.236 0.8133
## poly(V3, 5, raw = TRUE)2 -9.212e+02 3.888e+03 -0.237 0.8129
## poly(V3, 5, raw = TRUE)3 2.082e+03 8.769e+03 0.237 0.8125
## poly(V3, 5, raw = TRUE)4 -1.985e+03 8.338e+03 -0.238 0.8120
## poly(V3, 5, raw = TRUE)5 6.780e+02 2.841e+03 0.239 0.8116
## poly(V4, 5, raw = TRUE)1 9.005e+00 3.973e+01 0.227 0.8209
## poly(V4, 5, raw = TRUE)2 -5.504e+01 2.417e+02 -0.228 0.8200
## poly(V4, 5, raw = TRUE)3 1.049e+02 4.608e+02 0.228 0.8200
## poly(V4, 5, raw = TRUE)4 -5.916e+01 2.599e+02 -0.228 0.8201
## poly(V4, 5, raw = TRUE)5 NA NA NA NA
## poly(V5, 5, raw = TRUE)1 NA NA NA NA
## poly(V5, 5, raw = TRUE)2 NA NA NA NA
## poly(V5, 5, raw = TRUE)3 NA NA NA NA
## poly(V5, 5, raw = TRUE)4 NA NA NA NA
## poly(V5, 5, raw = TRUE)5 NA NA NA NA
## poly(V6, 5, raw = TRUE)1 -1.253e-01 1.144e-01 -1.095 0.2743
## poly(V6, 5, raw = TRUE)2 1.383e+00 7.815e-01 1.769 0.0779 .
## poly(V6, 5, raw = TRUE)3 -3.409e+00 2.064e+00 -1.651 0.0998 .
## poly(V6, 5, raw = TRUE)4 2.972e+00 2.308e+00 1.287 0.1991
## poly(V6, 5, raw = TRUE)5 1.406e-02 9.192e-01 0.015 0.9878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02402 on 280 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1346 on 23 and 280 DF, p-value: < 2.2e-16
print(paste("------------------------------ANOVA 5 SUMMARY------------------------------"))
## [1] "------------------------------ANOVA 5 SUMMARY------------------------------"
summary(interaction)
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(V1, 5, raw = TRUE) 4 0.019 0.005 8.183 2.97e-06 ***
## poly(V2, 5, raw = TRUE) 5 0.010 0.002 3.325 0.0062 **
## poly(V3, 5, raw = TRUE) 5 0.002 0.000 0.806 0.5459
## poly(V4, 5, raw = TRUE) 4 0.005 0.001 2.194 0.0699 .
## poly(V6, 5, raw = TRUE) 5 17.830 3.566 6179.849 < 2e-16 ***
## Residuals 280 0.162 0.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n\n")
From the analysis of the above models in their RMSE, R-squared, MSE, sum of squares, significance of F and t values, and F-statistic, we can see that a polynomial of degree 4 is probably best suited for our given dataset.
Degree 1
Residual standard error (RSE): 0.1443
R-squared: 0.6571
F-statistic: 94.87
Sum of squares (SS): 6.181
MSE: 0.021
Starting from Degree 1 (linear), we can see that the only real variable that is significant to V7 is V6. Although initially these error rates appear low, it is clear that this model has a poor fit to the data when looking at these values against the other degree summaries.
Degree 2
Residual standard error (RSE): 0.06702
R-squared: 0.9275
F-statistic: 310.2
Sum of squares (SS): 1.307
MSE: 0.004
There is a significant jump in the F-statistic (indicating that the variables are more likely to contribute to the output, and not chance), and R-squared, and RSE, SS and MSE have all decreased significantly. This indiciates that this model is a significantly better fit for our data, however V6 is still the only significant variable.
Degree 3
Residual standard error (RSE): 0.03018
R-squared: 0.9856
F-statistic: 1084
Sum of squares (SS): 0.260
MSE: 0.001
Again, a very big jump in the F-statistic indicating that the variables are becoming more significant as the model increases in order. R-squared has also improved a fair bit and is very close to perfect. The RSE, SS and MSE have all decreased, indicating that again this model is an even better fit than the last. Although V6 is still the most significant variable, V1 is starting the show through as having significance, as is V2.
Degree 4
Residual standard error (RSE): 0.02398
R-squared: 0.991
F-statistic: 1412
Sum of squares (SS): 0.162
MSE: 0.001
The R-squared and F-statistic values have increased again, and the RSE and SS have decreased as well. The MSE has not changed but this is probably because it is already so small. The R-squared value is close to perfect, however in the analysis of each of the variables, we can see that V5 does not have enough data or significance at order levels of 3 and 4 (\(V5^3\) and \(V5^4\) appear as NA in the table). V6 is still the most significant variable, with V2 also appearing to be more significant, and V1 rising in significance too.
Degree 5
Residual standard error (RSE): 0.02402
R-squared: 0.991
F-statistic: 1346
Sum of squares (SS): 0.162
MSE: 0.001
We can see that both RSE and F-statistic have actually started to decrease again in this model, and that there is no change in the R-squared, SS and MSE. Looking at the significance of variables, we can also see that many of the variables are no longer contributing to the model (eg. \(V1^5\), \(V4^5\), all of V5). It is clear that although this model appears quite accurate, it is clear that this model performs worse compared to Degree 4, since it is excluding large chunks of data in an effort to fit to the model, and is increasing in error.
We can say that for our dataset, a model of degree 4 would be a better fit, and this can be further supported by analysing the AIC values for each of the models.
model.set <- list(deg1, deg2, deg3, deg4, deg5)
model.names <- c("deg1", "deg2", "deg3", "deg4", "deg5")
aictab(model.set, modnames = model.names)
The above table is ranked in order from best to worst. We can see that a model of degree 4 has the lowest AIC value, and a degree of 1 performing the worst. Deg4 and Deg5 both have the highest log-likelihood, but Deg4 wins out, probably due to the reasons as mentioned above.
In summary, I believe my initial model underfits the data heavily due to a low R-squared value, and the fact that the data is not suitable for single degree linear regression. I also believe that in my model, the only significant variable was V6 - the Froude number, with all other variables having very little influence.
I would like to also note that in my research I have found a paper recommending for the prismatic coefficient (V2) as a function of Froude number (https://www.researchgate.net/publication/262486291_A_Practical_Design_Approach_and_RANSE-based_Resistance_Prediction_for_Medium-speed_Catamarans) which is interesting because as I found in the models with higher degrees (eg. Deg4), the V2 variable was starting to appear as more significant.
(0:Ridge and 1:LASSO) lambda: range for the regularization parameter
fitAndPlot <- function(train.data, train.label, alpha, lambda){
# fit the model
fit <- glmnet(x = as.matrix(train.data), y = train.label, alpha = alpha, lambda = lambda)
# aggrigate the outputs - take the beta output of fit and transpose it (t) then add df and lambda values
out <- as.data.frame(as.matrix(t(fit$beta)))
out[,c('nonzero', 'lambda')] <- c(fit$df, fit$lambda)
# reshape the outputs (for plotting)
out.m<-melt(out, id=c('lambda', 'nonzero'))
names(out.m) <- c('lambda', 'nonzero', 'feature', 'coefficient')
# plot coefficients vs lambda
g <- ggplot(data = out.m, aes(x=lambda, y=coefficient, color=factor(feature))) +
geom_line() +
ggtitle('Coefficients vs. lambda') +
theme_minimal()
print(g)
# plot number of nonzero coefficients (as ameasure of model complexity) vs lambda
g <- ggplot(data = out.m, aes(x=lambda, y=nonzero)) +
geom_line() +
scale_y_continuous(breaks=seq(0,10,1)) +
scale_color_discrete(guide = guide_legend(title = NULL)) +
ggtitle('Nonzero Coefficients vs. lambda') +
theme_minimal()
print(g)
# run the predictions
train.predict <- predict(fit, newx=as.matrix(train.data))
test.predict <- predict(fit, newx=as.matrix(test.data2))
# calculate the errors
error <- data.frame('lambda' = out$lambda,
'train' = sqrt(colSums((train.predict - train.label)^2)/nrow(train.predict)),
'test' = sqrt(colSums((test.predict - test.target2)^2)/nrow(test.predict)))
print(tail(error, 1))
error.m <- melt(error, id='lambda')
names(error.m) <- c('lambda', 'set', 'RMSE')
# plot sum of squarred error for train and test sets vs lambda
g <- ggplot(data = error.m, aes(x=lambda, y = RMSE, color = factor(set))) +
geom_line() +
ylim(0,6) +
scale_color_discrete(guide = guide_legend(title = NULL)) +
ggtitle('RMSE vs. lambda') +
theme_minimal()
print(g)
}
fitAndPlot (train.data2, train.target2, alpha=1, lambda = c(0:1000)/10000)
## lambda train test
## s1000 0 0.1421077 0.1447743
fitAndPlot (train.data2, train.target2, alpha=0, lambda = c(0:1000)/10000)
## lambda train test
## s1000 0 0.1421072 0.1447694
Running these two models was much much faster than running my gradient descent model, going to show the effects of having a penalty can lead to a much faster response. While initially I was a little worred about the error rate because the model learned so quickly, looking at the RMSE of both Ridge and LASSO, they are almost exactly the same as my GD model, and as each other.
However here’s a very clear difference in the final models between the two. LASSO has very quickly identified that most of the variables are not important to the output and by the end of the run, we have only one non-zero coefficient which is V6 (Froude number).
Ridge has kept all 6 variables and has set them all at different values.
I believe that Ridge ran slightly faster than LASSO. In reality on large datasets, Ridge is much faster, but keeps all variables which may not be ideal.
I’d like to run these 3 variations on larger datasets to see how they perform.
In this question, you are required to implement a Logistic Regression model to classify whether a person has liver disease or not. Please read the sub-questions below carefully for detailed instructions.
# rm(list = ls(all.names = TRUE))
df3 = read.csv('ILPD.csv')
colnames(df3)
## [1] "Age" "TB" "DB" "Alkphos"
## [5] "Sgpt" "Sgot" "TP" "ALB"
## [9] "A.G.Ratio" "Liver.Patient"
Columns:
Age
Age of the patient
TB => Total Bilirubin
Bilirubin is an orange-yellow pigment that occurs normally when part of your red blood cells break down. A bilirubin test measures the amount of bilirubin in your blood. It’s used to help find the cause of health conditions like jaundice, anemia, and liver disease. Normal results for a total bilirubin test are 1.2 milligrams per deciliter (mg/dL) for adults
DB => Direct Bilirubin
Bilirubin attached by the liver to glucuronic acid, a glucose-derived acid, is called direct, or conjugated, bilirubin. Normal results for direct bilirubin are generally 0.3 mg/dL
Alkphos => Alkaline Phosphotase
Alkaline phosphatase (ALP) is an enzyme in a person’s blood that helps break down proteins. Using an ALP test, it is possible to measure how much of this enzyme is circulating in a person’s blood. The normal range for serum ALP level is 20 to 140 IU/L. High levels may indicate liver damage.
Sgpt => Alamine Aminotransferase Alamine aminotransferase (ALT) is an enzyme found primarily in the liver and kidney. ALT is increased with liver damage and is used to screen for and/or monitor liver disease. A normal range is considered to be between 7 to 56 units per liter of serum.
Sgot => Aspartate Aminotransferase
Aspartate Aminotransferase (AST) is an enzyme that is found mostly in the liver, but also in muscles. When your liver is damaged, it releases AST into your bloodstream. An AST blood test measures the amount of AST in your blood. The test can help your health care provider diagnose liver damage or disease. A normal range is considered to be between 10 to 40 units per liter of serum.
TP => Total Proteins
Albumin and globulin are two types of protein in your body. The total protein test measures the total amount albumin and globulin in your body. Normal range is between 6 and 8.3 grams per deciliter (g/dL). Low values may indicate liver disease.
ALB => Albumin
Albumin is produced only in the liver, and is the major plasma protein that circulates in the bloodstream. A low serum albumin indicates poor liver function. Normal range is 3.4 to 5.4 g/dL.
A.G.Ratio => Albumin and Globulin Ratio
The albumin/globulin ratio is the amount of albumin in the serum divided by the globulins. The ratio is used to try to identify causes of change in total serum protein. A Decrease in albumin without decrease in globulins (ie. A.G.Ratio < 1) can indicate severe liver disease. A good ratio is 1:1.
Liver.Patient
Indicates whether the patient has liver disease or not. 1 indicates disease, 2 indicates healthy.
str(df3)
## 'data.frame': 583 obs. of 10 variables:
## $ Age : int 65 62 62 58 72 46 26 29 17 55 ...
## $ TB : num 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.9 0.7 ...
## $ DB : num 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.3 0.2 ...
## $ Alkphos : int 187 699 490 182 195 208 154 202 202 290 ...
## $ Sgpt : int 16 64 60 14 27 19 16 14 22 53 ...
## $ Sgot : int 18 100 68 20 59 14 12 11 19 58 ...
## $ TP : num 6.8 7.5 7 6.8 7.3 7.6 7 6.7 7.4 6.8 ...
## $ ALB : num 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 4.1 3.4 ...
## $ A.G.Ratio : num 0.9 0.74 0.89 1 0.4 1.3 1 1.1 1.2 1 ...
## $ Liver.Patient: int 1 1 1 1 1 1 1 1 2 1 ...
colSums(is.na(df3))
## Age TB DB Alkphos Sgpt
## 0 0 0 0 0
## Sgot TP ALB A.G.Ratio Liver.Patient
## 0 0 0 4 0
There is a total of 583 rows of data in the dataset, and that 4 rows in column A.G.Ratio have null values. The values have also been read in as numerical and integer data types which I will leave as is. I will remove the rows with null values and plot the data in boxplots to determine any significant outliers.
# remove null values
df3 <- na.omit(df3)
# confirm 4 rows with null values have been removed
dim(df3)
## [1] 579 10
M <- cor(df3)
corrplot(M, method="number")
# str(df3)
# check if any errors in output column
unique(df3[,"Liver.Patient"])
## [1] 1 2
# change Liver.Patient label, where 1=has disease, 2=0=no disease
df3$Liver.Patient[df3$Liver.Patient == 2] <- 0
summary(df3)
## Age TB DB Alkphos
## Min. : 4.00 Min. : 0.400 Min. : 0.100 Min. : 63.0
## 1st Qu.:33.00 1st Qu.: 0.800 1st Qu.: 0.200 1st Qu.: 175.5
## Median :45.00 Median : 1.000 Median : 0.300 Median : 208.0
## Mean :44.78 Mean : 3.315 Mean : 1.494 Mean : 291.4
## 3rd Qu.:58.00 3rd Qu.: 2.600 3rd Qu.: 1.300 3rd Qu.: 298.0
## Max. :90.00 Max. :75.000 Max. :19.700 Max. :2110.0
## Sgpt Sgot TP ALB
## Min. : 10.00 Min. : 10.0 Min. :2.700 Min. :0.900
## 1st Qu.: 23.00 1st Qu.: 25.0 1st Qu.:5.800 1st Qu.:2.600
## Median : 35.00 Median : 42.0 Median :6.600 Median :3.100
## Mean : 81.13 Mean : 110.4 Mean :6.482 Mean :3.139
## 3rd Qu.: 61.00 3rd Qu.: 87.0 3rd Qu.:7.200 3rd Qu.:3.800
## Max. :2000.00 Max. :4929.0 Max. :9.600 Max. :5.500
## A.G.Ratio Liver.Patient
## Min. :0.3000 Min. :0.000
## 1st Qu.:0.7000 1st Qu.:0.000
## Median :0.9300 Median :1.000
## Mean :0.9471 Mean :0.715
## 3rd Qu.:1.1000 3rd Qu.:1.000
## Max. :2.8000 Max. :1.000
We can see correlations between TB and DB which makes sense as DB contributes to TB, Sgot and Sgpt are also strongly related, as is ALB and TP and A.G.Ratio and ALB - again this makes sense since ALB contributes to both TP and A.G.Ratio.
Next split data into test and train sets
## [1] 405 9
## [1] 174 9
Taking the following steps is necessary to build a logistic regression:
c0 <- 0
c1 <- 1
# auxiliary function that calculate a cost function
cost <- function (w, X, T, c0){
sig <- sigmoid(w, X)
return(sum(ifelse(T==c0, 1-sig, sig)))
}
# auxiliary function that predicts class labels
pred <- function(w, X, c0, c1){
sig <- sigmoid(w, X)
return(ifelse(sig>0.5, c1,c0))
}
# Sigmoid function (=p(C1|X))
sigmoid <- function(w, x){
return(1.0/(1.0+exp(-w%*%t(cbind(1,x)))))
}
# Accuracy function
get_accuracy <- function(w) {
# predict targets with current model
preds <- t(as.matrix(pred(w,train.data3,c0,c1)))
# create confusion matrix, converting targets and predictions to factors
tab <- table(factor(preds, levels=min(train.target3):max(train.target3)),
factor(train.target3, levels=min(train.target3):max(train.target3)))
# calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
return(acc)
}
plot_coeffs <- function(W, model) {
# convert W matrix to a dataframe and melt for plotting
W.df <- as.data.frame(W); names(W.df) <- c('w0','w1','w2','w3','w4', 'w5', 'w6', 'w7', 'w8', 'w9')
W.df$tau <- 1:nrow(W.df)
W.m <-melt(W.df, id='tau');
names(W.m) <- c('tau', 'coefficients', 'values')
ggplot(data=W.m, aes(x=tau, y=values, color=coefficients)) + geom_line() +
labs(title='Estimated Coefficients', subtitle=model) +
theme_minimal()
}
plot_costs <- function(costs,eta) {
ggplot(data=costs, aes(x=tau, y=log(cost)), color=black) +
# geom_line() + ggtitle('Log of Cost over Time') + theme_minimal()
geom_line() + labs(title='Log of Cost over Time', subtitle=paste("Learning rate:", eta)) + theme_minimal()
}
plot_acc <- function(acc, eta) {
ggplot(data=acc, aes(x=tau, y=accuracy), color=black) +
# geom_line() + ggtitle('% Accuracy over Time') + theme_minimal()
geom_line() + labs(title='% Accuracy over Time', subtitle=paste("Learning rate:", eta)) + theme_minimal()
}
executeLogReg <- function(data, target, tau.max, eta, epsilon, algorithm) {
# Initializations
tau <- 1 # iteration counter
terminate <- FALSE
## Just a few name/type conversion to make the rest of the code easy to follow
X <- as.matrix(data) # rename just for convenience
T <- ifelse(target==c0,0,1) # rename just for convenience
W <- matrix(,nrow=tau.max, ncol=(ncol(X)+1)) # to be used to store the estimated coefficients
set.seed(ncol(W))
W[1,] <- runif(ncol(W)) # initial weight (any better idea?)
# W[1,] <- 0
# project data using the sigmoid function (just for convenient)
Y <- sigmoid(W[1,],X)
costs <- data.frame('tau'=1:tau.max) # to be used to trace the cost in each iteration
costs[1, 'cost'] <- cost(W[1,],X,T, c0)
model_acc <- data.frame('tau'=1:tau.max) # to be used to trace the accuracy in each iteration
model_acc[1, 'accuracy'] <- get_accuracy(W[1,])
while(!terminate){
# check termination criteria:
terminate <- tau >= tau.max | cost(W[tau,],X,T, c0)<=epsilon
# # check termination criteria:
if (tau >= tau.max | cost(W[tau,],X,T, c0) <=epsilon) {terminate<-TRUE;break}
if (algorithm == "bgd") {
# Get gradient
Y <- sigmoid(W[tau,],X)
# Update the weights
W[(tau+1),] <- W[tau,] - eta * 1/nrow(X) * (Y-T) %*% cbind(1, X)
# record the cost:
costs[(tau+1), 'cost'] <- cost(W[tau,],X,T, c0)
# record the accuracy
model_acc[(tau+1), 'accuracy'] <- get_accuracy(W[tau,])
# update the counter:
tau <- tau + 1
# decrease learning rate:
eta = eta * 0.999
}
if (algorithm == "sgd") {
# shuffle data:
train.index <- sample(1:train_len, train_len, replace = FALSE)
X <- X[train.index,]
T <- T[train.index]
# for each datapoint:
for (i in 1:train_len) {
# check termination criteria:
if (tau >= tau.max | cost(W[tau,],X,T, c0) <=epsilon) {terminate<-TRUE;break}
Y <- sigmoid(W[tau,],X)
# Update the weights
W[(tau+1),] <- W[tau,] - eta * (Y[i]-T[i]) * cbind(1, t(X[i,]))
# record the cost:
costs[(tau+1), 'cost'] <- cost(W[tau,],X,T, c0)
# record the accuracy
model_acc[(tau+1), 'accuracy'] <- get_accuracy(W[tau,])
# update the counter:
tau <- tau + 1
# decrease learning rate:
eta = eta * 0.999
}
}
}
costs <- costs[1:tau, ] # remove the NaN tail of the vector (in case of early stopping)
model_acc <- model_acc[1:tau, ]
W <- W[1:tau,]
f_coeffs <- W[tau,]
f_acc <- tail(model_acc, 1)
result <- list("costs" = costs, "acc" = model_acc, "W" = W, "f_coeffs" = f_coeffs, "f_acc" = f_acc)
return(result)
}
bgd.1 <- executeLogReg(train.data3, train.target3, 5000, 0.1, 0.01, "bgd")
print(paste("The final accuracy is:",round(bgd.1$f_acc[2], digits=2),"%"))
## [1] "The final accuracy is: 69.88 %"
cat("The final coefficients are:", bgd.1$f_coeffs)
## The final coefficients are: -1.685641 1.15582 13.25407 8.140503 0.1855517 0.7293219 -0.1552714 -13.89243 -8.953661 -2.72975
plot_costs(bgd.1$costs, 0.1)
The cost of the model drops quite quickly at the beginning, but then smooths out around the 2000th iteration, after which there is no significant improvement in the cost. The model could probably stop after the 2500th iteration.
plot_coeffs(bgd.1$W, "BGD, Learning rate: 0.1")
The model has spent a significant amount of time estimating the coefficients, and it appears that it is still modifying them after the 5000th iteration. The values jump around a fair bit at the beginning, but this is due to the initial random values that the coefficients were assigned.
Looking at the final accuracy of the data, the model appears to have learned the data fairly quickly and performs average.
plot_acc(bgd.1$acc, 0.1)
The accuracy graph is very interesting because it jumps around so much and is very consistent in behaviour with the cost graph as can be seen below:
grid.arrange(
plot_acc(bgd.1$acc, 0.1),
plot_costs(bgd.1$costs, 0.1)
)
It is interesting to see how the accuracy starts off so low most likely due to the random initialisation of W, but then continues to jump around a lot while the model tries to learn. It is also interesting to see that the lower the accuracy, the lower the cost
It is interesting to note that even though the model is changing the value of the ceofficients, even at the 5000th iteration, it doesn’t appear to have any cost or accuracy improvements, which is why I stopped it.
I will look at accuracy, sensitivity and specificity.
In medical settings, sensitivity and specificity are the two most reported ratios from the confusion matrix.
Sensitivity (AKA Recall): true positive rate (true positive)/(true positive+false negative). This describes what proportion of patients with liver disease are correctly identified as having liver disease. If high, we aren’t missing many people with liver disease. If low, then we are missing these people and they won’t receive the treatment they should.
Specificity: true negative rate (true negative)/(true negative+false positive). What proportion of healthy patients are correctly identified as healthy? If high, we are marking healthy as healthy. If low, we have false positives and people will either incorrectly receive treatment or in some other way incorrectly respond to a false positive.
# auxiliary function that predicts class labels
prob <- function(sigval){
# sigval <- sig(x)
return(ifelse(sigval>0.5, 1, 0))
}
# Sigmoid function (=p(C1|X))
sig <- function(x){
return(1.0/(1.0+exp(-x)))
}
evalRegModel <- function(data, target, coeffs) {
model_results <- setNames(data.frame(matrix(ncol = 3)), c("actual", "score", "predicted"))
size <- length(target)
for (i in 1:size) {
row <- as.matrix(cbind('v0'=1, data[i,]))
score <- sig(sum(data.frame(mapply(`*`, row, coeffs))))
yhat <- prob(score)
y <- target[i]
model_results <- rbind(model_results, c(y, score, yhat))
}
# print(model_results)
model_results <- na.omit(model_results)
# create confusion matrix, converting targets and predictions to factors
tab <- table(factor(model_results$predicted, levels=min(test.target3):max(test.target3)),
factor(model_results$actual, levels=min(test.target3):max(test.target3)))
# print(tab)
# calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
# sensitivity AKA recall
sen <- round(tab[1,1]/sum(tab[1,]), digits=2)
spec <- round(tab[2,2]/sum(tab[2,]), digits=2)
prec <- round(tab[1,1]/sum(tab[,1]), digits=2)
fscore <- round(2*((prec * sen)/(prec + sen)), digits=2)
result <- list("acc" = acc, "sen" = sen, "spec" = spec, "prec" = prec, "fscore" = fscore)
return(result)
}
test_acc <- evalRegModel(test.data3, test.target3, bgd.1$f_coeffs)
print(paste("The model has a", round(test_acc$acc, digits=2), "% accuracy rate on the test data"))
## [1] "The model has a 63.79 % accuracy rate on the test data"
print(paste("Sensitivity:", test_acc$sen))
## [1] "Sensitivity: 0.34"
print(paste("Specifity:", test_acc$spec))
## [1] "Specifity: 0.8"
print(paste("Precision:", test_acc$prec))
## [1] "Precision: 0.49"
print(paste("Fscore:", test_acc$fscore))
## [1] "Fscore: 0.4"
Looking at the model’s performance on the test data set, we can see that it has \(\approx 64 \%\) accuracy in correctly predicting whether a patient has liver disease or not, which is not that great.
We have a low sensitivity of \(0.34\), meaning our model is missing many patients with liver disease, so these patients would not be identified for treatment.
Our specificity is a much better value at \(0.80\), indicating that our model is much better at identifying patients who are healthy.
In a real medical setting, it would be important to have a high sensitivity AND high specificity, so I would say this model performs quite poorly. This is probably due to the size of the dataset, and the fact that we have used all of the variables available, whereas in reality we should only use the variables which are better at predicting disease.
bgd10 <- executeLogReg(train.data3, train.target3, 5000, 10, 0.01, "bgd")
bgd1.0 <- executeLogReg(train.data3, train.target3, 5000, 1.0, 0.01, "bgd")
bgd.01 <- executeLogReg(train.data3, train.target3, 5000, 0.01, 0.01, "bgd")
bgd.001 <- executeLogReg(train.data3, train.target3, 5000, 0.001, 0.01, "bgd")
bgd.0001 <- executeLogReg(train.data3, train.target3, 5000, 0.0001, 0.01, "bgd")
c1 <- plot_costs(bgd10$costs, 10)
c2 <- plot_costs(bgd1.0$costs, 1.0)
c3 <- plot_costs(bgd.1$costs, 0.1)
c4 <- plot_costs(bgd.01$costs, 0.01)
c5 <- plot_costs(bgd.001$costs, 0.001)
c6 <- plot_costs(bgd.0001$costs, 0.0001)
grid.arrange(
c1, c2, c3, c4, c5, c6,
ncol=2
)
a1 <- plot_acc(bgd10$acc, 10)
a2 <- plot_acc(bgd1.0$acc, 1.0)
a3 <- plot_acc(bgd.1$acc, 0.1)
a4 <- plot_acc(bgd.01$acc, 0.01)
a5 <- plot_acc(bgd.001$acc, 0.001)
a6 <- plot_acc(bgd.0001$acc, 0.0001)
grid.arrange(
a1, a2, a3, a4, a5, a6,
ncol=2
)
By changing the learning rate, we can see that when the learning rate is high, our accuracy tends to be quite high, however the cost tends to jump around all over the place and it takes the model a longer time to plateau out.
\(\eta = 10\) We can see the cost starts to flatten out around the 2700th iteration, but it is still learning as we can see the line is jumping around a lot (it is still quite thick). A model with this learning rate would take many more iterations before it could confidently converge. We can also see that the accuracy of this model is still fluctuating, but it appears to be around 70%.
\(\eta = 1\) The model starts to flatten out around the 2500th iteration, but there is still some bouncing around in costs, until it gets to the 4000th iteration when it starts to smooth. The accuracy also starts to smooth at the 4500th iteration and is around the 70% mark.
\(\eta = 0.1\) This is the model that I have decided to go with in my previous answer. The cost begins to smooth much earlier than the previous rates at around the 2500th iteration and appears to be at the same sort of level of \(\approx 5.65\). The accuracy smooths out at the 2800th iteration, and is around the 70% mark
\(\eta = 0.01\) Similar performance to 0.1, however the cost smooths out slightly later as does the accuracy, however both perform quite similarly
\(\eta = 0.001\) We start to see improvements in the cost - the model now flattens out to a cost of \(5.5\) at around the 2200th iteration, however the accuracy drops down to around 65%.
\(\eta = 0.0001\) Cost is around the same as the previous model at \(5.56\), but the graph is smooth from the beginning and plateaus out very quickly. However accuracy drops again down to 64%.
We can see that when we change the learning rate on our model, it greatly affects the cost and accuracy and the model needs to do a lot of work before the costs start to flatten out. A large learning rate at the beginning tends to more iterations but has a fairly high accuracy. A small learning rate shows that although the cost is lower, the accuracy is less
sgd.1 <- executeLogReg(train.data3, train.target3, 5000, 0.1, 0.01, "sgd")
df_compare <- setNames(data.frame(matrix(ncol = 2)), c("SGD", "BGD"))
df_compare[1,1] <- round(sgd.1$f_acc[2], digits=2)
df_compare[1,2] <- round(bgd.1$f_acc[2], digits=2)
row.names(df_compare) <- "% Accuracy"
print(df_compare)
## SGD BGD
## % Accuracy 49.66 69.88
csgd <- plot_coeffs(sgd.1$W, "SGD")
cbgd <- plot_coeffs(bgd.1$W, "BGD")
grid.arrange(
csgd, cbgd
)
costsgd <- plot_costs(sgd.1$costs, 0.1)
costbgd <- plot_costs(bgd.1$costs, 0.1)
grid.arrange(
costsgd, costbgd
)
Both the SGD and BGD methods have very similar accuracy rates, however the costs in SGD are much higher and don’t appear to smooth out the way that they do in BGD. Also the coefficient estimation graph is interesting. The SGD estimates jumped around (at some point it estimated w4 to be over 100!) a lot more than BGD and took longer to converge, which may be due to the size of the dataset. SGD works much faster than BGD on very large datasets, to perhaps our dataset of 403 rows of data is not large enough to show the benefits of SGD. Also the error rate of SGD is much higher than BGD and does not smooth out the same way BGD does. This is expected due to the random nature of SGD and it appears to be getting smaller as it goes through more iterations and will probably flatten out in further iterations.
Please note that I have two methods of training the models in kfoldcv - using glm() and using my own batch gradient descent (BGD) algorithm as defined above. Running the BGD algorithm takes my computer up to an hour so I have done my analysis on the glm outputs instead. However I have left the code to run in on the BGD in the function. To run it on BGD, please uncomment lines 1524-1526, and comment out 1528-1552.
kfoldcv <- function(k, data) {
# shuffle data
set.seed(42)
rows <- sample(nrow(data))
shuff_data <- data[rows, ]
N <- nrow(df3)
Min_index <- 0
folds <- list()
# create folds
for(i in 1:k) {
if(i==k) {
Max_index <- (floor(N/k))*i+(N%%k) # For last fold make the size as the proportion N/k plus the remainder
}
else {
Max_index <- (floor(N/k))*i
}
folds[i] <- list(shuff_data[(Min_index+1):Max_index,]) # Append the fold to the list
Min_index <- Max_index # Change the starting point for next loop
}
Kresults <- setNames(data.frame(matrix(ncol = 4)), c("accuracy", "precision", "recall", "fscore")) # Store the results
for(j in 1:k) { # For each fold
test <- data.frame(folds[j]) # get one test chunk
train <- list()
for(i in 1:k) {
if(i!=j) { # Avoid binding the current test fold
train <- rbind(train, data.frame(folds[i])) # attach other folds to be train set
}
}
# the below two lines can be used to run the gradient descent function defined in Q3.3
# fit <- executeLogReg(train[,-10], train[,10], 1000, 0.1, 0.1,"bgd")
# preds <- evalRegModel(test[,-10], test[,10], fit$f_coeffs)
# Kresults <- rbind(Kresults, c(preds$acc, preds$prec, preds$sen, preds$fscore))
# *************************** Using glm ***************************
fit <- glm(Liver.Patient ~ ., data=train, family="binomial" (link='logit'))
# predict the targets
probs <- data.frame(predict(fit, test, type="response"))
preds <- ifelse(probs > 0.5,1,0)
output <- cbind(test, preds)
# create confusion matrix, converting targets and predictions to factors
tab <- table(factor(output[,11], levels=min(test):max(test)),
factor(output[,10], levels=min(test):max(test)))
# calculate % accuracy = (TP + TN) / (TP + TN + FP + FN)
acc <- sum(diag(tab)/(sum(rowSums(tab)))) * 100
# sensitivity AKA recall
sen <- round(tab[1,1]/sum(tab[1,]), digits=2)
# precision
prec <- round(tab[1,1]/sum(tab[,1]), digits=2)
# f-score
fscore <- round(2*((prec * sen)/(prec + sen)), digits=2)
Kresults <- rbind(Kresults, c(acc, prec, sen, fscore))
# *************************** to here ***************************
}
Kresults <- na.omit(Kresults)
Kresults <- rbind(Kresults, c(mean(Kresults$accuracy),
mean(Kresults$precision),
mean(Kresults$recall),
mean(Kresults$fscore)))
Kresults <- na.omit(Kresults)
final <- tail(Kresults,n=1)
res <- list("Kresults" = Kresults, "final" = final)
return(res)
}
k10 <- kfoldcv(10, df3)
print(k10$final)
## accuracy precision recall fscore
## 111 71.43541 0.275 0.514 0.333
We can see that running 10-fold cross validation shows us that the accuracy is actually slightly higher than our initial evaluation in Q3.3. However the recall and precision is much lower, and fscore shows that the model is looking quite underfitted. This is interesting because I expected the model to perform a lot better in precision and recall.
kAll <- setNames(data.frame(matrix(ncol = 5)), c("K-value", "accuracy", "precision", "recall", "fscore")) # Store the results
k5 <- kfoldcv(5, df3)
row <- cbind("K-value"=5, k5$final)
kAll <- rbind(kAll, row)
row <- cbind("K-value"=10, k10$final)
kAll <- rbind(kAll, row)
k25 <- kfoldcv(25, df3)
row <- cbind("K-value"=25, k25$final)
kAll <- rbind(kAll, row)
k50 <- kfoldcv(50, df3)
row <- cbind("K-value"=50, k50$final)
kAll <- rbind(kAll, row)
k100 <- kfoldcv(100, df3)
row <- cbind("K-value"=100, k100$final)
kAll <- rbind(kAll, row)
k250 <- kfoldcv(250, df3)
row <- cbind("K-value"=250, k250$final)
kAll <- rbind(kAll, row)
k450 <- kfoldcv(450, df3)
row <- cbind("K-value"=450, k450$final)
kAll <- rbind(kAll, row)
kN <- kfoldcv(nrow(df3), df3)
row <- cbind("K-value"=nrow(df3), kN$final)
kAll <- rbind(kAll, row)
kAll <- na.omit(kAll)
kAll
ga <- ggplot(kAll,aes(x=`K-value`,y=accuracy)) +
geom_line() +
geom_point(shape=21, color="black", fill="blue", size=3) +
ggtitle("K-value vs. Accuracy")
gb <- ggplot(kAll,aes(x=`K-value`,y=precision)) +
geom_line() +
geom_point(shape=21, color="black", fill="pink", size=3) +
ggtitle("K-value vs. Precision")
gc <- ggplot(kAll,aes(x=`K-value`,y=recall)) +
geom_line() +
geom_point(shape=21, color="black", fill="purple", size=3) +
ggtitle("K-value vs. Recall")
gd <- ggplot(kAll,aes(x=`K-value`,y=fscore)) +
geom_line() +
geom_point(shape=21, color="black", fill="orange", size=3) +
ggtitle("K-value vs. F-score")
grid.arrange(ga, gb, gc, gd)
We can see that as the number K increases, our models performance also increases, up to the point where when K=N, we have 100% accuracy. This is of course expected since the model has more iterations of training, but I am surprised to see that there is such a difference in the F-score and the accuracy.
Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes. Considering there were many more positive cases in our data, than negative, it would make sense that F-score is a better indicator for model accuracy, and so it appears that our model doesn’t really have a good F-score until we get to K=250 onwards. So even though accuracy appears high especially at small k-values, I don’t think the model actually performs that well until it gets to K=250, when we see a big improvement in the F-score. Of course, K=450 performs even better and is practically perfect.
There is also a marked improvement in the precision and recall as we go up in K. For precision, the model is quite imprecise up to K=100. This could be due to our dataset having many more positive points (1s), than negative points (0s) hence the model had more data to learn from. Precision improves drastically at K=250 - jumping from 0.58 to 0.86.
It is interesting to note how well the model’s recall is at lower K values. We can see at K=100 the recall is already very high at 0.86. Again I think this is due to the fact that there are many more positive cases in the dataset than there are negative, but it shows just how much that affects the recall.
In the gradient descent algorithms, we can we can apply regularization terms to the model’s weights. This will add a cost to the loss function large weights (or parameter values). This forces a simpler model that learns only the relevant patterns in the train data.
There are two types of regularisation techniques: L1 and L2. L1 regularization will add a cost with regards to the absolute value of the parameters, resulting in some of the weights to equal zero.
L2 regularization will add a cost with regards to the squared value of the parameters, resulting in smaller weights. Larger coefficients strongly penalized because of the squaring.
L2 is better suited to data that is very complex and difficult to be modelled accurately, whereas L1 is better for data that is easily modelled. L1 is also robust to outliers and Can be used for feature selection, but is slower to converge than L2.
What this would mean for our changes in the gradient descent algorithm is that there is an added parameter in our cost function when we update our weights.
If we added an L2 regularisation term to our cost function, this would result in the Variance being reduced i.e. coefficients are reduced in magnitude.
If we added an L1 regularisation, it would result in reduced Variance and will also act as a feature selector. This could also result in a different number of coefficients in our calculations.
Assuming our cost function is \(f(x)\), then we are trying to calculate \(min_x f(x)\). When adding regularisation, our cost function now looks like: \(min_x f(x) +\lambda g(x)\) where \(\lambda\) is a hyperparameter that helps control the relative importance of the regulariser function \(g(x)\).
An interesting point that I came across in my research shows that the optimal point may not always be at 0. When we look at our cost function, our aim is to minimise it as much as possible so it becomes close to 0. However this may not always be the case.
An L2-regularised cost function is smooth, meaning the optimum value is at a stationary point (where derivative = 0). The stationary point of \(f(x)\) can get very small when \(\lambda\) is increased but won’t equal 0.
L1-regularized cost function is non-smooth, meaning it may not be differentiable at 0. Again, we assume the optimum of a function is either at a point where the derivative = 0, but in a non-smooth function (such as \(y=|x|\)), this would be at corner (an irregularity) so it’s possible that the optimal point of \(f(x)\) is 0 even if 0 isn’t the stationary point of f.